This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

The idea here is to use exploratory data analysis in order to be able to show faulty borrowers in a better light, and be able to distinguish some or their most remarkable characteristics apart from those who do succeed to keep up on their payments, and finish.

By this data, does employment status at the time of listing have any visible correlation with loan status?

It is important to mention that the dataset distinguishes between borrowers that are ‘Employed’, ‘Full time’ and ‘Part time’; this leads one to believe that this data maybe faulty since one category really is a set made up by the other two.

Univariate Plots

Here, it is easy to notice that the greater bulk of borrowers are made up of those who were employed at the time of loan listing. this group is made up of ‘Employed’, ‘Full-time’, ‘Self-employed’ in that order. We can also see that there are too many field names to fit on this small graph frame. we can use the theme function to tilt these labels, and the fill aesthetic for coloring, like so:

We notice here that there is an extra column in our histogram with no label. so we use the unique function to create a vector, named “c”, which will contain all the unique labels of our rows.

You can see that there is a graph column with no name (or “”). to get rid of this we save a subset of our initial data set to a variable, without these rows with values.

Now, let’s Graph a few other variables

##       EmploymentStatus
##  Employed     :67322  
##  Full-time    :24872  
##  Self-employed: 6029  
##  Not available: 2959  
##  Other        : 2474  
##  Part-time    : 1002  
##  (Other)      : 1563

In our dataset our employed borrowers take the lead as the biggest group of borrowers of all emplyment status groups. The interesting part is that the “Employed” status should be logically be composed of subcategories, like “Full-time”, “Other”, “Part-time” or “Self-employed”; instead it is accounted for as a whole different state.

grid.arrange(ggplot(aes(x=LoanOriginalAmount),
                    data = csv_file.reassesedEmp)+
             geom_histogram(bins = 30),
             ggplot(aes(x=1, y=LoanOriginalAmount), 
                    data = csv_file.reassesedEmp)+
             geom_boxplot(), nrow=1)

summary(csv_file.reassesedEmp["LoanOriginalAmount"])
##  LoanOriginalAmount
##  Min.   : 1000     
##  1st Qu.: 4000     
##  Median : 7000     
##  Mean   : 8527     
##  3rd Qu.:12000     
##  Max.   :35000

It is interesting to see the outliers in this distribution. It makes one wonder as to why these $150,000 loans are so popular in the data. Aside from the outliers a right tailed destribution can be perceived, where the majority of loan values sit at the lower dollar-value end.

qplot(CreditScoreRangeLower, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["CreditScoreRangeLower"])
##  CreditScoreRangeLower
##  Min.   :  0.0        
##  1st Qu.:660.0        
##  Median :680.0        
##  Mean   :689.5        
##  3rd Qu.:720.0        
##  Max.   :880.0

Here we get a notion of the worst recorded credit health of the borrowers. It is nice to see that the bulk of the scores sit in between 625~750, which can be considered fair to excellent.

qplot(CreditScoreRangeUpper, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["CreditScoreRangeUpper"])
##  CreditScoreRangeUpper
##  Min.   : 19.0        
##  1st Qu.:679.0        
##  Median :699.0        
##  Mean   :708.5        
##  3rd Qu.:739.0        
##  Max.   :899.0

Once again we notice the bulk-majority of our scores at the fair-to-excellent range, this time peeking towards the higher scores It’s nice to see such a healthy consistency among loaners’ scores.

qplot(CreditScoreRange, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["CreditScoreRange"])
##  CreditScoreRange
##  Min.   :19      
##  1st Qu.:19      
##  Median :19      
##  Mean   :19      
##  3rd Qu.:19      
##  Max.   :19

This is a true oddity, how is it possible tha all of our borrowers credit scores have only varied by 19 points the entire lifetime before withdrawing a loan? Is this a mistake? if so, was it machine-made or man-made? was it intentional? (After the 2008 financial crisis, official investigations uncovered evidence of banks coaching or guiding soon-to-be borrower to answer their questionaires to be more likely to be approved)

qplot(BorrowerState, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["BorrowerState"])
##  BorrowerState  
##  CA     :14406  
##  TX     : 6617  
##  NY     : 6591  
##  FL     : 6583  
##  IL     : 5840  
##  GA     : 4853  
##  (Other):61331

It seems as if more populars states to live in received the greatest amounts of loans. This notion may make sense since the more people that live in a state, the more people to ask for loans there are, with California taking the lead on this one by over twice the amount of loans of the next state in line, Texas.

qplot(Occupation, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["Occupation"])
##                     Occupation   
##  Other                   :27487  
##  Professional            :13122  
##  Computer Programmer     : 4243  
##  Executive               : 4189  
##  Teacher                 : 3649  
##  Administrative Assistant: 3547  
##  (Other)                 :49984

Since there ae a lot more professions other than the ones listed here, it may be understandable that most loans were given to borrowers with “Other” occupations. The next category, “Professional”, may be a subjective title, which once again brings me to the idea that there is a presence of fraud in this data. Now, in third place we programmers. Not surprising since we are currently living in the information era.

qplot(LoanStatus, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

summary(csv_file.reassesedEmp["LoanStatus"])
##                  LoanStatus   
##  Current              :55272  
##  Completed            :33768  
##  Chargedoff           :10833  
##  Defaulted            : 4091  
##  Past Due (1-15 days) :  798  
##  Past Due (31-60 days):  358  
##  (Other)              : 1101

It is encouraging to see that the majority of loans are Completed or current. However, the presence of fraud is still looming over all of the current loans, the amount of loans defaulted and charged-off.

The fill aesthetic can be used to add color difference among levels of each factor variable on their respective graphs, but it wasn’t used here for the sake of simplicity, it will be used later on.

Univariate Analysis

What is the structure of your dataset?

My dataset is made up of 26 of the 81 total variables of the original dataset. I cut them short in the interest of time and memory space. Although more data is always better it seemed justifiable to cut this dataset to investigate a few variables at first since the discoveries in relations among so many variables could very well take a great amount of time to understand.

What is/are the main feature(s) of interest in your dataset?

My main features of interests are LoanStatus and EmploymentStatus, in order to find the best signs of borrowers defaulting on their payments, since logic follows that you will not have money to pay a loan if you do not have a job, which is most people’s main source of income. Other features of interest may be: the type of job(Occupation) a borrower has, since different careers can make different amounts of money.Loan amount (LoanOriginalAmount) since it is harder to pay a bigger loan than it is a small one. Also, credit score, since it is a score system based on previous information of a borrower’s likely of paying previous debt.

Did you create any new variables from existing variables in the dataset?

I created a new variable called CreditScoreRange which contains the range between the maximum credit score, and minimum credit score. I believe that there is a negative correlation between a borrower’s likelyhood to pay off their loan and the magnitude of this number.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

As we move through these variables and graph their count per level, we noticed that certain categories in each graph are more likely to posess a higher count than others. For example, the employment graph demonstrates that most loans are given to those who were employed at the time the borrowers asked for the loan. The CreditScoreRangeLowre and CreditScoreRangeUpper categories show that most loan pertain to those borrower with credit scores roughly in the range of 500 to 750. the BorrowerState shows that California is the state with most borrowed loans. In the occupation variable graph we see that there is a high quantity of borrowers who wrote down ‘Other’ as their occupation, and the most popular loan Statuses are ‘Chargedoff’, ‘Completed’ and ‘Current’, as demonstrated in the LoanStatus graph.

Bivariate Plots

We can also use the fill aesthetic in order to distinguish among employment statuses in the count vs LoanStatus graph:

## Warning: Ignoring unknown parameters: binwidth, bins, pad

At a quick glance, it is noticble how most of these loan-status categories are ruled by a greater majority of employed borrowers, except by “Completed” and “Chargedoff” categories, which is somehow a contradictory.

Let’s take a look at those groups that have not comply with their payment deadlines. These include: “Defaulted” and all variations of “Past Due”.

##                    Employed     Full-time Not available  Not employed 
##             0           675            28             0             9 
##         Other     Part-time       Retired Self-employed 
##            38             2             2            44

Once more, due to the ambiguity of the description “Employed” we cannot rely on this information completely. “Employed” should be a column of its own containing True and False arguments, with the other categories in this data column being in a different one. However, assuming that “Employed” encapsules other values, such as:

“Full-time” “Part-time” “Self-employed”

then it is surprising that they dominate the over this loan status categories, given that such categories imply an income with which a loan is payed.

ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Past Due (16-30 days)'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Past Due (16-30 days)"))
##                    Employed     Full-time Not available  Not employed 
##             0           226             7             0             5 
##         Other     Part-time       Retired Self-employed 
##            12             1             0            13

The same arguments as the above chart’s are valid here, since distribution of emplyment status accross loan status are eerily similar; with the exception of full-time employed borrowers being relatively smaller in this instance.

ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Past Due (31-60 days)'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Past Due (31-60 days)"))
##                    Employed     Full-time Not available  Not employed 
##             0           293            13             0             3 
##         Other     Part-time       Retired Self-employed 
##            23             2             2            22

Once more, the pattern repeats, the relative sizes of employed borrower categories do not differ much as we ascend in the length of time delay in loan payments. This again keeps the presence of fraud afloat in this data. Why would borrowers who have jobs be late on their payments?

The next couple of graphs are too similar to these past ones therefore there is not a lot of new information we could extract from them:

ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Past Due (61-90 days)'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Past Due (61-90 days)"))
##                    Employed     Full-time Not available  Not employed 
##             0           262            11             0             6 
##         Other     Part-time       Retired Self-employed 
##            13             1             1            19
ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Past Due (91-120 days)'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Past Due (91-120 days)"))
##                    Employed     Full-time Not available  Not employed 
##             0           255            12             0             3 
##         Other     Part-time       Retired Self-employed 
##            18             2             1            13
ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Past Due (>120 days)'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Past Due (>120 days)"))
##                    Employed     Full-time Not available  Not employed 
##             0            14             0             0             0 
##         Other     Part-time       Retired Self-employed 
##             0             0             0             2

Ultimately, it is surprising to see that these two categories of borrower are here. Even thou they are small in quantity compared to previous loan status categories, it is still concerning that they are present at all.

ggplot(aes(x=LoanStatus,
           fill=factor(EmploymentStatus)),
       data = subset(csv_file.reassesedEmp,
                     LoanStatus=='Defaulted'))+
  geom_bar(position='dodge')+
  scale_y_log10()

summary(subset(csv_file.reassesedEmp$EmploymentStatus,
               csv_file.reassesedEmp$LoanStatus ==
               "Defaulted"))
##                    Employed     Full-time Not available  Not employed 
##             0           630          2050           985            25 
##         Other     Part-time       Retired Self-employed 
##            54            56            54           237

We can see that Employed borrowers make up the majority of all “Past Due” bars. Whereas “Full-time” and “Not Available” make up the majority of the “Defaulted” category.

Let’s use the same strategy to contrast more than one supporting variable at a time. Here we did not use all of the variables that we graphed previously since some of these variable are continuos, integer data.

a <-sort(table(csv_file.reassesedEmp$BorrowerState),decreasing=TRUE)[1:10]

These are the top ten states with the most amount of loans.

##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##     0     0     0     0     0 14406     0     0     0     0  6583  4853 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##     0     0     0  5840     0     0     0     0     0     0     0  3477 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##     0     0     0     0     0     0     0     0  3053     0     0  6591 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4139     0     0     0     0     0     0     0  6617     0  3233     0 
##    WA    WI    WV    WY 
##     0     0     0     0

Although differen in size, the spread of the top 10 borrowing state seems to be consistentamong loan statuses, i.e. the states seem to have relative equal varieties of loan statuses as each other.

Bivariate analysis a very simple way to find the spread of categories, like those of “EmploymentStatus” accross the body of another variable’s magnitude, LoanStatus vs Count.

We could also employ ggplot’s facet_wrap in order to make different graphs of any categorical variable vs Count accross all levels of “LoanStatus”

I personally really enjoy using facet wrap since it is both an efficient and visually complex graphing tool. Here, we better compare different loan statuses side by side in terms of different employment stauses.

qplot(LoanOriginalAmount,
      fill = LoanStatus,
      bins = 35, data = csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
  scale_y_log10()+
  facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 175 rows containing missing values (geom_bar).

ggplot(aes(x=LoanStatus,
           y=LoanOriginalAmount,
           fill=LoanStatus),
       data=csv_file.reassesedEmp)+
  scale_y_log10()+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))+
  geom_violin()

Here we examine the spread of instances of each loan amount amongst all possible loan status. We can see that the Status such as “Chargedoff”, “Completed” and “Defaulted” have a greater variance in loan amounts. Whereas the others, except for “Cancelled”, hang at the top of the spectrum with greater original amount values.

qplot(CreditScoreRangeLower, fill = LoanStatus, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
  scale_y_log10()+
  facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 283 rows containing missing values (geom_bar).

In this graph there is more consistency across one of our continuous variables, leading to better shaped distributions of instances of our loans. In other words, our number of loans accross any loan status is more symmetrically distributed along our “CreditScoreRangeLower” variable then it was by “LoanAmount”. However, it is interesting to see that in the category graphs “Current” and “Completed” the bulk of loan count amounts, in terms of the lower range of borrower credit scores, peeks on the lower credit score side. In these two categorie, being on the lower side made someone more likely to get a loan, according to this info.

qplot(CreditScoreRangeUpper, fill = LoanStatus, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
  scale_y_log10()+
  facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 289 rows containing missing values (geom_bar).

As in the graph above, we see the same borrower pattern. We see that more loans were granted at the lower end of the “CreditScoreRangeUpper” range. Which again shows a relation with between more loans as credit score lowers. Let’s keep this in mind as we look for more signs that may point towards fraud as a possible variable.

qplot(CreditScoreRange, fill = LoanStatus, bins = 35, data =
        csv_file.reassesedEmp)+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
  scale_y_log10()+
  facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 408 rows containing missing values (geom_bar).

This section really grabs should grab the readers attaention since we are finding that all possible borrowers either had no difference in value of their credit score (possibly indicating that these borrower, in their entire credit history have never been able to improve nor worsen their credit score) or that all other borrowers have all only affected their credit score by, exclusively +/- 19 points. These two sceneraios should be unlikely since credit scores are not static, iven a long enough amount of time, and definitely not change only by a constant value of 19.

Here we get a more discrete representation of each supporting variables’ description against the LoanStatus. It is easier to distinguish each column of count per supporting variable’s level against the grouping by LoanStatus’ levels than it was before. The fill aesthetic argument was introduced to make the bars more distinguishable.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
### investigation. How did the feature(s) of interest vary with other features in
### the dataset?

By simultaneously contrasting two supporting variables we can see their relationship with one another. When we analyzed individual columns of the LoanStatus vs EmploymentStatus we noticed that the main body of all “Past Due” variations are made up of employed borrower, and the “Defaulted” category is made up mainly of borrower who marke their employment status as “Full-time” and “Not Available”.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Yes, in the brainstorm section at the end of this report I visualized the relations between variables ‘CreditScoreRangeLower vs BorrowerState’ and ‘CreditScoreRangeUpper vs BorrowerState’. The observation I made was the similiarites in variance of each of these credit ranges of each state. That is to say that the variance in upper and lower credit scores were almost the same, per state.

What was the strongest relationship you found?

Fast forward to the part we use ggplot and facet_wrap, we can see here that is easier to spot any spike in count of loans according to their loan status, but also contrasted against any of the other variables, and their respective levels. For example, in Count vs EmploymentStatus facet_wrapped around Loanstatus we can see that there are major spike for employed and fulltime borrowers under the categories of chargedoff, completed and current. Another example are the spikes in the Count vs Ocupation vs LoanStatus graph; these spikes represent the high count of borrowers under the “Other” Occupation status, in the “Chargedoff”, “Current” and “Completed” categories of LoanStatus.

OVerall, these Bivariate graphing methods have served to better interpret the hidden relationships between variables. By classifying Data points further the addition of more discriminating factors we have brought a better understanding in our EDA rough draft, which may lead one to believe that consecutive addition of other variables will help understand this data set in new better ways.

Multivariate Plots

Since our subjects of interest are those which are past due on their payments, or have defaulted on their loans then we will go ahead and limit our data set to borrowers who fall under these categories.

csv_file.reassesedLS <- 
  csv_file.reassesedEmp[which(
    csv_file.reassesedEmp$LoanStatus == 'Past Due (1-15 days)'|
    csv_file.reassesedEmp$LoanStatus =='Past Due (16-30 days)'|
    csv_file.reassesedEmp$LoanStatus =='Past Due (31-60 days)'|
    csv_file.reassesedEmp$LoanStatus =='Past Due (61-90 days)'|
    csv_file.reassesedEmp$LoanStatus =='Past Due (91-120 days)'|
    csv_file.reassesedEmp$LoanStatus =='Past Due (>120 days)'|
    csv_file.reassesedEmp$LoanStatus =='Defaulted'),]

We should notice the count reduction of borrowers from 106221 to 6144.

Let’s make sure that only those levels of LoanStatus we are interested in have trickle down into our new dataset

unique(csv_file.reassesedLS$LoanStatus)
## [1] Past Due (1-15 days)   Defaulted              Past Due (16-30 days) 
## [4] Past Due (61-90 days)  Past Due (31-60 days)  Past Due (91-120 days)
## [7] Past Due (>120 days)  
## 12 Levels: Cancelled Chargedoff Completed Current ... Past Due (91-120 days)

Loan Status and Employment:

Before we had notice that borrowers with a ‘Employed’ employment status made up most of the mass of past due loans (including all of its variations in past due time), as opposed to other employment statuses. We will also look at the body of employed borrowers with ‘Defaulted’ loans. Now let’s take this information and contrast it against other variables.

Let’s reassess our data set again to narrow it down to those employed borrowers that fall under this Loan statuses.

csv_file.reassesedLSandEmp <- subset(csv_file.reassesedLS,
                                     csv_file.reassesedLS$EmploymentStatus ==
                                       'Employed')

Since we have narrowed our dataset further down into loan listings of employed borrowers whose loan status is not favorable we will now investigate the relations of this new set to the other variables. Here we will make a boxplot of our loan amounts. We do so in order to look at the variability of loan amounts according to any given categorical variable that fall under ‘Defaulted’ and all ‘Past Due’ variations of status.Remember, these are all employed borrowers.

Let’s reorganize our variables for a better visualization of our data:

csv_file.drops <- csv_file.reassesedLSandEmp[,c("LoanStatus",
                                                "IncomeRange",
                                                "IsBorrowerHomeowner",
                                                "IncomeVerifiable",
                                                "EmploymentStatus",
                                                "FirstRecordedCreditLine",
                                                "EmploymentStatusDuration",
                                                "OpenCreditLines",
                                                "DelinquenciesLast7Years",
                                                "CreditScoreRange",
                                                "CreditScoreRangeLower",
                                                "CreditScoreRangeUpper",
                                                "TotalCreditLinespast7years",
                                                "TotalTrades",
                                                "CurrentDelinquencies",
                                                "TradesOpenedLast6Months",
                                                "StatedMonthlyIncome",
                                                "CurrentCreditLines",
                                                "AmountDelinquent",
                                                "DebtToIncomeRatio",
                                                "LoanOriginalAmount")]

GGally’s ggduo

In this case, ggduo can narrow down some work, this is because it can graph one variable of our dataset against all others simultanously. Here, we are comparing “LoanStatus” against the rest of the variables in separate chunks for better visualization.

drops <- c('Term',
           'ListingKey',
           'Occupation',
           'BorrowerState',
           'IsBorrowerHomeowner',
           'EmploymentStatus',
           'IncomeRange',
           'IncomeVerifiable')

csv_file.drops <- csv_file.drops[,!(names(csv_file.drops) %in% drops)]

ncol(csv_file.drops)
## [1] 17
ggduo(csv_file.drops, 1, 2:17)
## Warning: Removed 85 rows containing non-finite values (stat_boxplot).

This graph is too dense to be able to see it clearly. Let’s divide this group of graphs into multiple for a better look at the graphed data.

ggduo(csv_file.drops, 1, 2)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$FirstRecordedCreditLine)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1958-01-01" "1991-04-27" "1996-08-06" "1995-05-29" "2000-05-24" 
##         Max. 
## "2011-04-19"

There doesn’t seem to be a lot of relative variation in “FirstRecordedCreditLine” across time, in terms of these loan statuses; with the exception of “Past Due (>120 days)” (We previously noticed that this category is a very small subset of our data). Given this, we can better the similarities between faulty loan status through in terms of this new variable.

ggduo(csv_file.drops, 1, 3)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$EmploymentStatusDuration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   28.00   69.00   94.99  135.50  527.00

“EmploymentStatusDuration” works as another aspect employed faulty-borrowers of differnt loan status categories have in common, with minor variations.

ggduo(csv_file.drops, 1, 4)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$OpenCreditLines)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.000   8.000   9.008  12.000  33.000

We should make a remark about this graph on how the majority of employed borrowers, per loan status category is centers towards the lower side of the “OpenCreditLine” domain, with a few instances of outliers possesing more than 20 open credit lines at the time of loan acceptance.

ggduo(csv_file.drops, 1, 5)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$DelinquenciesLast7Years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.409   3.000  89.000

The amount of delinquencies outliers per Loan status (aside from “Past Due (>120 days)”) is remarkable in this graph, as oppososed to previous ones. Specially since delinquencies sound like they could be an important factor to take into account when considering accepting a borrower’s loan request.

ggduo(csv_file.drops, 1, 7)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$CreditScoreRange)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      19      19      19      19      19      19

We previously noted this odd coincidence. This an important fact to keep in mind for later when the in our multivariate analysis and reflection

ggduo(csv_file.drops, 1, 8)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$CreditScoreRangeLower)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   689.8   720.0   860.0

“CreditScoreRangeLower” seems to be one of our most consistant variables to graph. It seems as employed borrowers belonging to the different, faulty loan status categories share very similar low-credit-score ranges.

ggduo(csv_file.drops, 1, 9)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$CreditScoreRangeUpper)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   619.0   679.0   699.0   708.8   739.0   879.0

Since we know that the variation in credit score ranges per borrower can only vary by +/-19 point or none, it only makes sense that this graph is the same as previous where most values were just increased by 19 points

ggduo(csv_file.drops, 1, 10)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$TotalCreditLinespast7years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   18.00   26.00   27.72   35.00   97.00

“TotalCreditLinespast7years” is another variable that really impressed me. The fact more than half of all borrowers, each posses about 20 open credit lines seems counterintuitive in my opinion. Shouldn’t someone with these many open lines of credit be considered a liability?

ggduo(csv_file.drops, 1, 11)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$TotalTrades)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   22.00   23.33   30.00   75.00
ggduo(csv_file.drops, 1, 12)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$CurrentDelinquencies)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4841  0.0000 17.0000
ggduo(csv_file.drops, 1, 13)+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

summary(csv_file.drops$TradesOpenedLast6Months)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.9359  1.0000  8.0000

Furthermore, we can better describe the graphed ranges of each:

csv_file.drops <- csv_file.reassesedLSandEmp[,c("LoanStatus",
                  "IncomeRange",
                  "IsBorrowerHomeowner",
                  "IncomeVerifiable",
                  "EmploymentStatus",
                  "FirstRecordedCreditLine",
                  "EmploymentStatusDuration",
                  "OpenCreditLines",
                  "DelinquenciesLast7Years",
                  "CreditScoreRange",
                  "CreditScoreRangeLower",
                  "CreditScoreRangeUpper",
                  "TotalCreditLinespast7years",
                  "TotalTrades",
                  "CurrentDelinquencies",
                  "TradesOpenedLast6Months",
                  "StatedMonthlyIncome",
                  "CurrentCreditLines",
                  "AmountDelinquent",
                  "DebtToIncomeRatio",
                  "LoanOriginalAmount")]

We finally, redefine csv_file.drops to use next some of the columns that we previously got rid off.

Here we will graph 3 varibles at once, controlling for employment status and loan status. We will also graph linear models.

ggplot(aes(x = DelinquenciesLast7Years,
           y = OpenCreditLines,
           color = LoanStatus ),
       data = csv_file.drops)+
  geom_point(alpha = 0.1, size =1)+
  geom_smooth(method = "lm", se = FALSE,size = 0.5)+ #lm = linear model
  scale_color_brewer(type='seq', 
                     palette=7,
                     guide=guide_legend(title='LoanStatus'))

summary(csv_file.drops$OpenCreditLines)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.000   8.000   9.008  12.000  33.000

Unsurprisingly, according to the model, the greater the amount of delinquencies per borrower the lower the amount of Credit Lines these were granted, accross all faulty loan statuses.

ggplot(aes(x = DelinquenciesLast7Years,
           y = StatedMonthlyIncome,
           color =LoanStatus  ),
       data = csv_file.drops)+
  geom_point(alpha = 0.1, size =1)+
  geom_smooth(method = 'lm', se = FALSE, size = 0.5)+
  scale_color_brewer(type = 'seq',
                     palette = 7,
                     guide=guide_legend(title='LoanStatus'))

summary(csv_file.drops$StatedMonthlyIncome)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.25  3167.00  4583.00  5320.00  6585.00 33330.00

Here, there almost seems to be no relation among “StatedMonthIncome” and “DelinqueciesLast7Years” regardless of faulty loan status by employed borrowers.

ggplot(aes(x = StatedMonthlyIncome,
           y = LoanOriginalAmount,
           color = LoanStatus),
        data = csv_file.drops)+
        geom_point(alpha = 0.1, size =1)+
        geom_smooth(method = 'lm', se = FALSE, size = 0.5)+
        scale_color_brewer(type = 'seq',
                           palette = 7,
                           guide=guide_legend(title='LoanStatus'))

In this graphs we notice a few characteristics like the negative trend between open credit lines and delinquencies in the last 7 years among borrowers, with all unfavorable levels of loan status, who are employed. Also, there is a positive trend between stated monthly income and loan original amount, with all unfavorable levels of loan status, for borrowers who are employed.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Let’s take a look at some interaction with a correlation value of higher than 0.3 (Significant):

“AmountDelinquent” and “TotalCreditLinespast7years” | cor 0.41

“StatedMonthlyIncome” and “LoanOriginalAmount” | cor 0.335

“LoanOriginalAmount” vs “CreditScoreRangeUpper(Lower)” | cor 0.323

The correlation between amount delinquent and total credit lines opened in the past 7 years may be so relatively highly correlated since it makes sense that the greater the net-credit (sum of all the given credit in every line) a borrower possess, the more room there is for a borrower to owe, even past a delinquent status.

Were there any interesting or surprising interactions between features?

Although the correlation between “LoanOriginalAmount” vs “CreditScoreRangeUpper(Lower)” and “StatedMonthlyIncome” vs “LoanOriginalAmount” are considered significant, They are not as highly correlated as I expected them to be. At first, under my logic, I suspected that these respectively compared variables would have a correlation of close to 1 since it made sense that the more money you make, the more you can ask for in a loan; By contrast, the lower a borrower’s lowest credit score range, the least likely they are to get more money in their loan. But then I realized that I was not taking the other variables into account at the moment, and that I had over estimated the impact of these two variables on “LoanOriginalAmount”.

Final Plots and Summary

Plot One

Description One

I believe that this is the best plot from our univariable plot section. Since our main variable of interest is the our “LoanStaus” variable, we can use this plot to clealy depict the magnitude of each of it’s levels. We can see that although our unfavorable loan status are relatively uncommon, such loans still exist. From here we can better distinguish the rest of our dataset to learn from these individuals with unfavorable loan statuses.

Plot Two

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Description Two

Here, just like in the last plot we are showcasing the magnitudes of each possible loan status, but we are using the fill aes argument to visually separate the different employment statuses that any borrower could have noted at the time of listing. This addition serves to demonstrate the composition of each of this loan statuses in terms of employment statuses. we can also use the coord_cartesian wrap to zoom in onto a single loan status level at a time (or more), and the scale_y_continous wrap to zoom in vertically_wise.

I was able to singled out each column in this graph, which helped in getting a sense of percetage composition of each one of these columns by employment status. Later on I was able to futher filter my data to investigate that employment status which seem to make part of those Loan statuses of interest.

Plot Three

Description Three

Here we visualize all trends of each level of loan status against the amount of loan original amount vs the delinquencies in the last 7 years of each borrower who is employed. I believe it is important to highlight these relations since they demonstrate the increasing tendency in loan original amounts according to monthly income for all unfavorable loan status, but at different rates.


Reflection

It was an interesting project to work on. It served me in helping me build a strategy with which to approach a data set for explorational purposes. Better than that is the fact that I got the chance to do so with a dataset of the financial sector, which peaks my interest.

Some of the struggles I faced at first were finding a purpose for any information I could extract, and selecting my variables carefully, since this is a large set of data. After that, I struggled with few bugs in Rstudio with the graph image resizing (which ultimately led me to discover the “theme” wrap for the functions qplot an ggplot, and jpeg image saving, although I did not use this one in this project). I also found another way to apply a series of conditional filters to a whole dataset, which then led me to discover the “which”" argument in the multivariable plot section.

The biggest surprise to me was the plotted the composition of loans per loan status, in terms of employment status in the bivariate plot section. It is clearly visible that those unfavorable status loans were mostly made up of employed borrowers. Although a “past due” does not necessarily imply that a loan will be defaulted on, it does raise question as to why those who are employ tend be the majority of the crowd of those who miss payments, and subsequently may be predisposed to default on their loans. Another surprise was finding how similar most borrower with unfavorable loan statuses are to the rest (I found this with code that can be found in my brainstorm section below) Although not exactly similar to the rest of the borrowers, our borrowers in the csv_file.drops data set, share a lot of characteristics with their counter parts for example: their high and low credit score ranges had a similiar variation, their employment status durations were very similar and so was the spread of original loan amounts. There are even more statistical comparison that can be seen below.

Obviously, there is a lot more relationships that are hidden behind this data, and even more if one is to look at the entire original dataset. Some of the other variables, and impact on loan status, I would have liked to investigate are “BorrowerAPR” and “BorrowerRate”; these two could potentially be some of the biggest factors in predicting whether a loan will be defaulted on or not.

Brainstorm

Everything below this point was part of all roughs drafts I previously made. I kept all of this here in case it would come in handy later on, not just the code but the ideas as well.

ggplot(aes(x = BorrowerState, y = LoanOriginalAmount), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

ggplot(aes(x = Occupation, y = LoanOriginalAmount), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

How about switching our continous variable with others?

CreditScoreRangeLower with categorical variable variations

## CreditScoreRangeLower vs BorrowerStates 
ggplot(aes(x = BorrowerState, y = CreditScoreRangeLower), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

## CreditScoreRangeLower vs Occupation 
ggplot(aes(x = Occupation, y = CreditScoreRangeLower), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

It is easy now to spot the variability of The lowest reported credit scores for all employed customers who have unfavorable loan statuses by their state of residence or by their profession. The ‘CreditScoreRangeLower’ vs ‘BorrowerState’ reveals that Idaho, Oklahoma and Wisconsin are some of the states with the largest variability. Alaska possess the lowest variability. Finally, Wyoming presents itself with the same variability of other states but at an overall higher range than others. In other words, Wyomings’ borrower’s credit scores tend to be higher than most others.

The “Occupation” vs “CreditScoreRangeLower” demonstrates that among those employed borrowers with unfavorable loan statuses, borrowers who are mdeical technicians have some of the highest spreads of Lower range credit scores. Whereas chemists, judges, borrower’s who work in the landscaping business and college freshmen/junior/Sophomore are amongst those with the lowest spread.

CreditScoreRangeUpper with categorical variable variations

## CreditScoreRangeUpper vs BorrowerStates 
ggplot(aes(x = BorrowerState, y = CreditScoreRangeUpper), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

## CreditScoreRangeUpper vs Occupation 
ggplot(aes(x = Occupation, y = CreditScoreRangeUpper), data = 
         csv_file.reassesedLSandEmp)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))

Just like we saw before WY has some of the best credit scores, in this case, the upper range. Like we noticed before ID, OK and WI have some of the greatest spread, but unlike before we notice that CT also possess a higher than normal spread, in the upper range of course.

At this point we have seen some of the few combination of interaction among a few variables of the latest subset of our original dataset.

Comparison

csv_file.counterDrops will be a new subset of our orginal data set in which you can find all the rows that are NOT in the csv_file.drops subset, but it will only contain the same columns as does csv_file.drops, this will help compare our specific subjects of interest againt the larger majority that is not employed AND possess an unfavorable loan status.

# make csv_file.counterDrops

# we are going to write csv_file2, the file against which we will compare 
# csv_file.drops in order to find those rows the belong in 
# csv_file.counterdrops, but we will do so, in the same column order as
# csv_file.drops

csv_file.counterDrops <- csv_file[,c("LoanStatus",
                  "EmploymentStatus",
                  "FirstRecordedCreditLine",
                  "EmploymentStatusDuration",
                  "OpenCreditLines",
                  "DelinquenciesLast7Years",
                  "CreditScoreRange",
                  "CreditScoreRangeLower",
                  "CreditScoreRangeUpper",
                  "TotalCreditLinespast7years",
                  "TotalTrades",
                  "CurrentDelinquencies",
                  "TradesOpenedLast6Months",
                  "StatedMonthlyIncome",
                  "CurrentCreditLines",
                  "AmountDelinquent",
                  "DebtToIncomeRatio",
                  "LoanOriginalAmount")]

csv_file.drops <- csv_file.drops[,c("LoanStatus",
                  "EmploymentStatus",
                  "FirstRecordedCreditLine",
                  "EmploymentStatusDuration",
                  "OpenCreditLines",
                  "DelinquenciesLast7Years",
                  "CreditScoreRange",
                  "CreditScoreRangeLower",
                  "CreditScoreRangeUpper",
                  "TotalCreditLinespast7years",
                  "TotalTrades",
                  "CurrentDelinquencies",
                  "TradesOpenedLast6Months",
                  "StatedMonthlyIncome",
                  "CurrentCreditLines",
                  "AmountDelinquent",
                  "DebtToIncomeRatio",
                  "LoanOriginalAmount")]

csv_file.counterDrops <- 
  csv_file.counterDrops[which(
    csv_file.counterDrops$EmploymentStatus != "Employed"&(
    csv_file.counterDrops$LoanStatus !='Past Due (1-15 days)'|
    csv_file.counterDrops$LoanStatus !='Past Due (16-30 days)'|
    csv_file.counterDrops$LoanStatus !='Past Due (31-60 days)'|
    csv_file.counterDrops$LoanStatus !='Past Due (61-90 days)'|
    csv_file.counterDrops$LoanStatus !='Past Due (91-120 days)'|
    csv_file.counterDrops$LoanStatus !='Past Due (>120 days)'|
    csv_file.counterDrops$LoanStatus !='Defaulted')),]

unique(csv_file.counterDrops$EmploymentStatus)
## [1] Self-employed Not available Full-time     Other                      
## [6] Not employed  Part-time     Retired      
## 9 Levels:  Employed Full-time Not available Not employed ... Self-employed
unique(csv_file.counterDrops$LoanStatus)
##  [1] Completed              Defaulted              Current               
##  [4] Chargedoff             Past Due (1-15 days)   Cancelled             
##  [7] Past Due (16-30 days)  Past Due (31-60 days)  FinalPaymentInProgress
## [10] Past Due (61-90 days)  Past Due (91-120 days) Past Due (>120 days)  
## 12 Levels: Cancelled Chargedoff Completed Current ... Past Due (91-120 days)
names(csv_file.counterDrops)
##  [1] "LoanStatus"                 "EmploymentStatus"          
##  [3] "FirstRecordedCreditLine"    "EmploymentStatusDuration"  
##  [5] "OpenCreditLines"            "DelinquenciesLast7Years"   
##  [7] "CreditScoreRange"           "CreditScoreRangeLower"     
##  [9] "CreditScoreRangeUpper"      "TotalCreditLinespast7years"
## [11] "TotalTrades"                "CurrentDelinquencies"      
## [13] "TradesOpenedLast6Months"    "StatedMonthlyIncome"       
## [15] "CurrentCreditLines"         "AmountDelinquent"          
## [17] "DebtToIncomeRatio"          "LoanOriginalAmount"
summary(csv_file.counterDrops)
##                  LoanStatus         EmploymentStatus
##  Completed            :25742   Full-time    :26355  
##  Chargedoff           : 8463   Self-employed: 6134  
##  Current              : 7647   Not available: 5347  
##  Defaulted            : 4388   Other        : 3806  
##  Past Due (1-15 days) :  131                : 2255  
##  Past Due (31-60 days):   70   Part-time    : 1088  
##  (Other)              :  174   (Other)      : 1630  
##  FirstRecordedCreditLine EmploymentStatusDuration OpenCreditLines 
##  Min.   :1947-08-24      Min.   :  0.00           Min.   : 0.000  
##  1st Qu.:1989-11-01      1st Qu.: 19.00           1st Qu.: 5.000  
##  Median :1995-03-27      Median : 51.00           Median : 8.000  
##  Mean   :1994-02-21      Mean   : 83.29           Mean   : 8.555  
##  3rd Qu.:1999-08-19      3rd Qu.:114.00           3rd Qu.:11.000  
##  Max.   :2012-02-14      Max.   :755.00           Max.   :51.000  
##  NA's   :697             NA's   :7624             NA's   :7604    
##  DelinquenciesLast7Years CreditScoreRange CreditScoreRangeLower
##  Min.   : 0.000          Min.   :19       Min.   :  0.0        
##  1st Qu.: 0.000          1st Qu.:19       1st Qu.:620.0        
##  Median : 0.000          Median :19       Median :660.0        
##  Mean   : 4.784          Mean   :19       Mean   :665.4        
##  3rd Qu.: 4.000          3rd Qu.:19       3rd Qu.:720.0        
##  Max.   :99.000          Max.   :19       Max.   :880.0        
##  NA's   :990             NA's   :591      NA's   :591          
##  CreditScoreRangeUpper TotalCreditLinespast7years  TotalTrades    
##  Min.   : 19.0         Min.   :  2.0              Min.   :  0.00  
##  1st Qu.:639.0         1st Qu.: 15.0              1st Qu.: 13.00  
##  Median :679.0         Median : 23.0              Median : 20.00  
##  Mean   :684.4         Mean   : 25.1              Mean   : 21.71  
##  3rd Qu.:739.0         3rd Qu.: 33.0              3rd Qu.: 29.00  
##  Max.   :899.0         Max.   :136.0              Max.   :126.00  
##  NA's   :591           NA's   :697                NA's   :7544    
##  CurrentDelinquencies TradesOpenedLast6Months StatedMonthlyIncome
##  Min.   : 0.0000      Min.   : 0.000          Min.   :      0    
##  1st Qu.: 0.0000      1st Qu.: 0.000          1st Qu.:   2550    
##  Median : 0.0000      Median : 0.000          Median :   4000    
##  Mean   : 0.9981      Mean   : 0.876          Mean   :   4842    
##  3rd Qu.: 1.0000      3rd Qu.: 1.000          3rd Qu.:   5833    
##  Max.   :83.0000      Max.   :17.000          Max.   :1750003    
##  NA's   :697          NA's   :7544                               
##  CurrentCreditLines AmountDelinquent DebtToIncomeRatio LoanOriginalAmount
##  Min.   : 0.000     Min.   :     0   Min.   : 0.000    Min.   : 1000     
##  1st Qu.: 6.000     1st Qu.:     0   1st Qu.: 0.130    1st Qu.: 2550     
##  Median : 9.000     Median :     0   Median : 0.210    Median : 4500     
##  Mean   : 9.782     Mean   :  1026   Mean   : 0.313    Mean   : 6233     
##  3rd Qu.:13.000     3rd Qu.:     0   3rd Qu.: 0.320    3rd Qu.: 8000     
##  Max.   :52.000     Max.   :444745   Max.   :10.010    Max.   :35000     
##  NA's   :7604       NA's   :7622     NA's   :7128
summary(csv_file.drops)
##                   LoanStatus       EmploymentStatus
##  Past Due (1-15 days)  :675   Employed     :2355   
##  Defaulted             :630                :   0   
##  Past Due (31-60 days) :293   Full-time    :   0   
##  Past Due (61-90 days) :262   Not available:   0   
##  Past Due (91-120 days):255   Not employed :   0   
##  Past Due (16-30 days) :226   Other        :   0   
##  (Other)               : 14   (Other)      :   0   
##  FirstRecordedCreditLine EmploymentStatusDuration OpenCreditLines 
##  Min.   :1958-01-01      Min.   :  0.00           Min.   : 0.000  
##  1st Qu.:1991-04-27      1st Qu.: 28.00           1st Qu.: 6.000  
##  Median :1996-08-06      Median : 69.00           Median : 8.000  
##  Mean   :1995-05-29      Mean   : 94.99           Mean   : 9.008  
##  3rd Qu.:2000-05-24      3rd Qu.:135.50           3rd Qu.:12.000  
##  Max.   :2011-04-19      Max.   :527.00           Max.   :33.000  
##                                                                   
##  DelinquenciesLast7Years CreditScoreRange CreditScoreRangeLower
##  Min.   : 0.000          Min.   :19       Min.   :600.0        
##  1st Qu.: 0.000          1st Qu.:19       1st Qu.:660.0        
##  Median : 0.000          Median :19       Median :680.0        
##  Mean   : 4.409          Mean   :19       Mean   :689.8        
##  3rd Qu.: 3.000          3rd Qu.:19       3rd Qu.:720.0        
##  Max.   :89.000          Max.   :19       Max.   :860.0        
##                                                                
##  CreditScoreRangeUpper TotalCreditLinespast7years  TotalTrades   
##  Min.   :619.0         Min.   : 2.00              Min.   : 1.00  
##  1st Qu.:679.0         1st Qu.:18.00              1st Qu.:15.00  
##  Median :699.0         Median :26.00              Median :22.00  
##  Mean   :708.8         Mean   :27.72              Mean   :23.33  
##  3rd Qu.:739.0         3rd Qu.:35.00              3rd Qu.:30.00  
##  Max.   :879.0         Max.   :97.00              Max.   :75.00  
##                                                                  
##  CurrentDelinquencies TradesOpenedLast6Months StatedMonthlyIncome
##  Min.   : 0.0000      Min.   :0.0000          Min.   :    0.25   
##  1st Qu.: 0.0000      1st Qu.:0.0000          1st Qu.: 3166.67   
##  Median : 0.0000      Median :1.0000          Median : 4583.33   
##  Mean   : 0.4841      Mean   :0.9359          Mean   : 5320.30   
##  3rd Qu.: 0.0000      3rd Qu.:1.0000          3rd Qu.: 6585.00   
##  Max.   :17.0000      Max.   :8.0000          Max.   :33333.33   
##                                                                  
##  CurrentCreditLines AmountDelinquent DebtToIncomeRatio LoanOriginalAmount
##  Min.   : 0.0       Min.   :     0   Min.   : 0.0100   Min.   : 1000     
##  1st Qu.: 6.0       1st Qu.:     0   1st Qu.: 0.1600   1st Qu.: 4000     
##  Median : 9.0       Median :     0   Median : 0.2300   Median : 6000     
##  Mean   :10.1       Mean   :  1417   Mean   : 0.3009   Mean   : 8051     
##  3rd Qu.:13.0       3rd Qu.:     0   3rd Qu.: 0.3500   3rd Qu.:10000     
##  Max.   :34.0       Max.   :183396   Max.   :10.0100   Max.   :35000     
##                                      NA's   :85

The point of this last section was to compare our subjects of interest to their counterparts in a more detailed aspect. More information can be derived from these summaries alone.